46 research outputs found

    A scalable machine-learning approach to recognize chemical names within large text databases

    Get PDF
    MOTIVATION: The use or study of chemical compounds permeates almost every scientific field and in each of them, the amount of textual information is growing rapidly. There is a need to accurately identify chemical names within text for a number of informatics efforts such as database curation, report summarization, tagging of named entities and keywords, or the development/curation of reference databases. RESULTS: A first-order Markov Model (MM) was evaluated for its ability to distinguish chemical names from words, yielding ~93% recall in recognizing chemical terms and ~99% precision in rejecting non-chemical terms on smaller test sets. However, because total false-positive events increase with the number of words analyzed, the scalability of name recognition was measured by processing 13.1 million MEDLINE records. The method yielded precision ranges from 54.7% to 100%, depending upon the cutoff score used, averaging 82.7% for approximately 1.05 million putative chemical terms extracted. Extracted chemical terms were analyzed to estimate the number of spelling variants per term, which correlated with the total number of times the chemical name appeared in MEDLINE. This variability in term construction was found to affect both information retrieval and term mapping when using PubMed and Ovid

    Some methods for blindfolded record linkage

    Get PDF
    BACKGROUND: The linkage of records which refer to the same entity in separate data collections is a common requirement in public health and biomedical research. Traditionally, record linkage techniques have required that all the identifying data in which links are sought be revealed to at least one party, often a third party. This necessarily invades personal privacy and requires complete trust in the intentions of that party and their ability to maintain security and confidentiality. Dusserre, Quantin, Bouzelat and colleagues have demonstrated that it is possible to use secure one-way hash transformations to carry out follow-up epidemiological studies without any party having to reveal identifying information about any of the subjects – a technique which we refer to as "blindfolded record linkage". A limitation of their method is that only exact comparisons of values are possible, although phonetic encoding of names and other strings can be used to allow for some types of typographical variation and data errors. METHODS: A method is described which permits the calculation of a general similarity measure, the n-gram score, without having to reveal the data being compared, albeit at some cost in computation and data communication. This method can be combined with public key cryptography and automatic estimation of linkage model parameters to create an overall system for blindfolded record linkage. RESULTS: The system described offers good protection against misdeeds or security failures by any one party, but remains vulnerable to collusion between or simultaneous compromise of two or more parties involved in the linkage operation. In order to reduce the likelihood of this, the use of last-minute allocation of tasks to substitutable servers is proposed. Proof-of-concept computer programmes written in the Python programming language are provided to illustrate the similarity comparison protocol. CONCLUSION: Although the protocols described in this paper are not unconditionally secure, they do suggest the feasibility, with the aid of modern cryptographic techniques and high speed communication networks, of a general purpose probabilistic record linkage system which permits record linkage studies to be carried out with negligible risk of invasion of personal privacy

    Testing a global standard for quantifying species recovery and assessing conservation impact

    Get PDF
    Recognizing the imperative to evaluate species recovery and conservation impact, in 2012 the International Union for Conservation of Nature (IUCN) called for development of a “Green List of Species” (now the IUCN Green Status of Species). A draft Green Status framework for assessing species’ progress toward recovery, published in 2018, proposed 2 separate but interlinked components: a standardized method (i.e., measurement against benchmarks of species’ viability, functionality, and preimpact distribution) to determine current species recovery status (herein species recovery score) and application of that method to estimate past and potential future impacts of conservation based on 4 metrics (conservation legacy, conservation dependence, conservation gain, and recovery potential). We tested the framework with 181 species representing diverse taxa, life histories, biomes, and IUCN Red List categories (extinction risk). Based on the observed distribution of species’ recovery scores, we propose the following species recovery categories: fully recovered, slightly depleted, moderately depleted, largely depleted, critically depleted, extinct in the wild, and indeterminate. Fifty-nine percent of tested species were considered largely or critically depleted. Although there was a negative relationship between extinction risk and species recovery score, variation was considerable. Some species in lower risk categories were assessed as farther from recovery than those at higher risk. This emphasizes that species recovery is conceptually different from extinction risk and reinforces the utility of the IUCN Green Status of Species to more fully understand species conservation status. Although extinction risk did not predict conservation legacy, conservation dependence, or conservation gain, it was positively correlated with recovery potential. Only 1.7% of tested species were categorized as zero across all 4 of these conservation impact metrics, indicating that conservation has, or will, play a role in improving or maintaining species status for the vast majority of these species. Based on our results, we devised an updated assessment framework that introduces the option of using a dynamic baseline to assess future impacts of conservation over the short term to avoid misleading results which were generated in a small number of cases, and redefines short term as 10 years to better align with conservation planning. These changes are reflected in the IUCN Green Status of Species Standard

    Relational Architecture

    No full text
    corecore